Add network validation script executed in the sagemaker_ui_post_startup script #713

marfriaz · 2025-06-19T23:51:55Z

Description

This change introduces the network validation script which tests if certain AWS services are reachable by making read only API calls with a set timeout. If the call exceeds the timeout, the script infers that it was caused by a bad network setup such as not having access to the internet/ VPC endpoint to make the call. API calls that resolve (succeed or fail) within the timeout are inferred as having the proper network setup.

AWS services for Compute Connections and Git are checked in this script. More specifically, the script lists the datazone connections to see which services need to be checked.

The unreachable services are aggregated and are displayed by writing to the post-startup-status.json, which displays the error notification in the IDE.

Testing

Tested in a SMUS portal containing internet, no internet, and no internet with VPC Endpoints to Datazone and s3.

Description

[Provide a brief description of the changes]

Type of Change

Release Information

Does this change need to be included in patch version releases? By default, any pull requests will only be added to the next SMD image minor version release once they are merged in template folder. Only critical bug fix or security update should be applied to new patch versions of existed image minor versions.

Yes (Critical bug fix or security update)
No (New feature or non-critical change)
N/A (Not an image update)

If yes, please explain why:
[Explain the criticality of this change and why it should be included in patch releases]

How Has This Been Tested?

Tested in a SMUS portal containing internet, no internet, and no internet with VPC Endpoints to Datazone and s3.

Checklist:

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works

Test Screenshots (if applicable):

Related Issues

[Link any related issues here]

Additional Notes

[Any additional information that might be helpful for reviewers]

reganbaum · 2025-06-21T01:47:03Z

template/v2/dirs/etc/sagemaker-ui/network_validation.sh

+# Initialize SERVICE_COMMANDS with always-needed STS and S3 checks
+declare -A SERVICE_COMMANDS=(
+  ["STS"]="aws sts get-caller-identity"
+  ["S3"]="aws s3api list-buckets --max-items 1"


Can we use list-objects at the project S3 path instead? This will result in a 4xx error as the project role doesn't have permissions to list buckets by default. While I understand the goal here is to check for what services are reachable, it would be better to invoke an API we expect to succeed so logs aren't polluted and 4xx metrics aren't impacted.

Good point, updating

reganbaum · 2025-06-21T01:48:32Z

template/v2/dirs/etc/sagemaker-ui/network_validation.sh

+  if [[ "$type" == "SPARK" ]]; then
+    # If sparkGlueProperties present, add Glue check
+    if echo "$item" | jq -e '.props.sparkGlueProperties' > /dev/null; then
+      SERVICE_COMMANDS["Glue"]="aws glue get-crawlers --max-items 1"


Same comment here, can we use glue get-catalogs or get-databases instead?

reganbaum · 2025-06-21T01:52:45Z

template/v2/dirs/etc/sagemaker-ui/network_validation.sh

+    # Check for emr-serverless in sparkEmrProperties.computeArn for EMR Serverless check
+    emr_arn=$(echo "$item" | jq -r '.props.sparkEmrProperties.computeArn // empty')
+    if [[ "$emr_arn" == *"emr-serverless"* ]]; then
+      SERVICE_COMMANDS["EMR Serverless"]="aws emr-serverless list-applications --max-results 1"


Same thing here. Can we use get-application or another emr-s API that the project role does have permission to call?

reganbaum · 2025-06-21T01:55:22Z

template/v2/dirs/etc/sagemaker-ui/network_validation.sh

+
+# Optionally add CodeConnections if S3 storage flag is true (Git storage)
+if [[ "$is_s3_storage" == "1" ]]; then
+  SERVICE_COMMANDS["CodeConnections"]="aws codeconnections list-hosts --max-results 1"


I haven't seen the managed policy updates for S3 storage; will those include this API?

Updating to

# If using Git Storage (S3 storage flag == 1), check CodeConnections connectivity # Domain Execution role contains permissions for CodeConnections if [[ "$is_s3_storage" == "1" ]]; then SERVICE_COMMANDS["CodeConnections"]="aws codeconnections list-connections --max-results 1 --profile DomainExecutionRoleCreds" fi

andychoquette · 2025-06-23T15:55:03Z

template/v2/dirs/etc/sagemaker-ui/network_validation.sh

+network_validation_file=${2:-"/tmp/.network_validation.json"}
+
+# Function to write unreachable services to a JSON file
+write_unreachable_services_to_file() {


How does this interact with the current post startup notifications file? For example, if this script throws an error, and the post startup script also throws an error, what is the user experience?

Thinking out loud here, but shouldn't we replace the network configuration error message we're writing in the post-startup script here: https://github.com/aws/sagemaker-distribution/blob/main/template/v3/dirs/etc/sagemaker-ui/sagemaker_ui_post_startup.sh#L102? - the checks in this CR seem to be a more robust solution.

write_unreachable_services_to_file will write to "/tmp/.network_validation.json", which will be read during the sagemaker_ui_post_startup script. If there are any unreachable services in there, we write to the network configuration error message.

You can view this logic in my changes to template/v2/dirs/etc/sagemaker-ui/sagemaker_ui_post_startup.sh

My understanding is that the error message that's being added in this PR will override the more generic error on L102 that Andy linked to, since on L219 in the post startup script in this PR, the more specific error message is written to the post startup notifications file:

write_status_to_file "error" "$error_message"

Or are we missing something else Andy? Does the post-startup script stop once an error is written to the notifications file? AFAIK it wouldn't, but that's not to say that another error could cause the script to exit before reaching the network validation script being added in this PR (but in that case, another error message would go to the notifications file).

To summarize, how my change interacts with current post startup notifications:

First, the post startup script will check if Datazone is accessible. If not, an error will be thrown and the script will timeout, and next steps will be skipped. Datazone needs to be accessible for the newly introduced network_validation script to be run as we rely on the is_s3_storage flag and DomainExecutionRoleCreds which require Datazone connectivity.

If Datazone connectivity is successful, in Jupyterlab, a workfows script is run. If that script fails, a generic error is shown: "Please stop and restart your space to retry."

Lastly, network_validation script is run and would override any past errors and be displayed to on the UI.

For 1., we can combine these checks in the network validation script, but IMO the downsides of that are:

Datazone connectivity is indication as to whether the user's domain is using a public subnet according to the error message it throws

This will involve duplicating logic to get the is_s3_storage flag and DomainExecutionRoleCreds in both the network_validation script and sagemaker_ui_post_startup script, since they are used in workflows script as well.

Datazone check is a prereq for network_validation script

thanks for clarifying - this makes sense.

…up script **Description** This change introduces the network validation script which tests if certain AWS services are reachable by making read only API calls with a set timeout. If the call exceeds the timeout, the script infers that it was caused by a bad network setup such as not having access to the internet/ VPC endpoint to make the call. API calls that resolve (succeed or fail) within the timeout are inferred as having the proper network setup. AWS services for Compute Connections and Git are checked in this script. More specifically, the script lists the datazone connections to see which services need to be checked. The unreachable services are aggregated and are displayed by writing to the post-startup-status.json, which displays the error notification in the IDE. **Testing** Tested in a SMUS portal containing internet, no internet, and no internet with VPC Endpoints to Datazone and s3.

reganbaum · 2025-06-24T19:28:03Z

template/v2/dirs/etc/sagemaker-ui/network_validation.sh

+network_validation_file=${2:-"/tmp/.network_validation.json"}
+
+# Function to write unreachable services to a JSON file
+write_unreachable_services_to_file() {


My understanding is that the error message that's being added in this PR will override the more generic error on L102 that Andy linked to, since on L219 in the post startup script in this PR, the more specific error message is written to the post startup notifications file:

write_status_to_file "error" "$error_message"

Or are we missing something else Andy? Does the post-startup script stop once an error is written to the notifications file? AFAIK it wouldn't, but that's not to say that another error could cause the script to exit before reaching the network validation script being added in this PR (but in that case, another error message would go to the notifications file).

andychoquette · 2025-06-25T22:36:05Z

template/v2/dirs/etc/sagemaker-ui/network_validation.sh

+network_validation_file=${2:-"/tmp/.network_validation.json"}
+
+# Function to write unreachable services to a JSON file
+write_unreachable_services_to_file() {


thanks for clarifying - this makes sense.

marfriaz requested a review from a team as a code owner June 19, 2025 23:51

marfriaz force-pushed the main branch 2 times, most recently from 7f52534 to bd4f0c3 Compare June 20, 2025 00:20

reganbaum reviewed Jun 21, 2025

View reviewed changes

andychoquette reviewed Jun 23, 2025

View reviewed changes

marfriaz force-pushed the main branch from bd4f0c3 to d20a276 Compare June 23, 2025 21:07

marfriaz force-pushed the main branch from d20a276 to 1b016c1 Compare June 23, 2025 22:30

reganbaum approved these changes Jun 24, 2025

View reviewed changes

andychoquette approved these changes Jun 25, 2025

View reviewed changes

aws-tianquaw approved these changes Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add network validation script executed in the sagemaker_ui_post_startup script #713

Add network validation script executed in the sagemaker_ui_post_startup script #713

marfriaz commented Jun 19, 2025 •

edited by PotatoWKY

Loading

Uh oh!

reganbaum Jun 21, 2025

Uh oh!

marfriaz Jun 23, 2025

Uh oh!

reganbaum Jun 21, 2025

Uh oh!

marfriaz Jun 23, 2025

Uh oh!

reganbaum Jun 21, 2025

Uh oh!

marfriaz Jun 23, 2025

Uh oh!

reganbaum Jun 21, 2025

Uh oh!

marfriaz Jun 23, 2025

Uh oh!

andychoquette Jun 23, 2025

Uh oh!

marfriaz Jun 23, 2025

Uh oh!

reganbaum Jun 24, 2025

Uh oh!

marfriaz Jun 25, 2025

Uh oh!

andychoquette Jun 25, 2025

Uh oh!

reganbaum Jun 24, 2025

Uh oh!

andychoquette Jun 25, 2025

Uh oh!

Uh oh!

Add network validation script executed in the sagemaker_ui_post_startup script #713

Are you sure you want to change the base?

Add network validation script executed in the sagemaker_ui_post_startup script #713

Conversation

marfriaz commented Jun 19, 2025 • edited by PotatoWKY Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Release Information

How Has This Been Tested?

Checklist:

Test Screenshots (if applicable):

Related Issues

Additional Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marfriaz commented Jun 19, 2025 •

edited by PotatoWKY

Loading